Integrated analysis toolkit for dissecting whole‐genome‐wide features of cell‐free DNA

Patterns in whole-genome-wide features of cell-free DNA (cfDNA) in human

More variable and higher cfDNA fragment ratios in the blood of gastric cancer patients (0.07747539-0.63051678) were observed than in healthy individuals (0.05973131-0.26309905) (Figure 2A). After the fragment ratio was normalized, strong cfDNA normalized fragment ratio consistency appeared in healthy blood. However, it was heterogeneous in gastric cancer blood ( Figure 2B,D,F). Another three cfDNA fragment size features, including the fraction of G and C-corrected fragments, GC-corrected short fragments (100-150 bp), and GC-corrected long fragments (150-220 bp), also showed similar performance ( Figure 2C,E,G and Figure S2A-F).
In healthy individuals, the copy number variance (CNV) scores of most bins were concentrated around zero, whereas the variance in the CNV scores in patients with gastric cancer was larger ( Figure S3A,B and Figure S4A, p = 0.023). Subsequently, significantly amplified CNV genes (n = 332) in gastric cancer patients were used to explore tumour-derived programs ( Figure S3C,D). Concretely, INAC identified the amplified CNV genes in tumour-derived programs ( Figure S4B).
To explore the relationship between promoter region cfDNA coverage and the corresponding chromatin state, cfDNA conventional relative coverage, cfDNA TSS NDR relative coverage and TSS 2K region relative coverage were used to estimate the promoter chromatin state ( Figure 3A and Figure S5A). The genes with both TSS NDR relative coverage and TSS 2K region relative coverage less than 1 in more gastric cancer patients were considered to possess an open chromatin state ( Figure 3B; red dots indicate permissive genes with an open TSS chromatin state). Whereas the genes with both TSS NDR relative coverage and TSS 2K region relative coverage greater than 1.5 in more patients were regarded as possessing closed chromatin states ( Figure 3C; yellow dots indicate nonpermissive genes with closed TSS chromatin states). TCGA STAD RNA-seq datasets were used to confirm that the permissive genes had significantly higher gene expression levels than the nonpermissive genes ( Figure 3D). TSS  NDR relative coverage and TSS 2K region relative coverage of the upregulated genes in the TCGA STAD RNA-seq datasets had significantly lower coverage than those of the downregulated genes ( Figure 3E-G). In addition, the TSS NDR relative coverage of healthy individuals had the opposite relationship with the RNA expression levels of PBMCs ( Figure S5B). Furthermore, 794 genes of the top 1000 expressed genes in TCGA STAD RNA-seq are predicated on the expressed status by TSS NDR relative coverage and TSS 2K region relative coverage in patients with gastric cancer ( Figure S5C). GTEx PBMC RNA-seq analysis revealed that 857 genes are predicted on the expressed status by TSS NDR relative coverage and TSS 2K region relative coverage in healthy samples ( Figure S5D).
Consistent with previous conclusions, the upregulated genes with high PFE had higher gene expression levels than the downregulated genes with low PFE (p = 1.4e-8) 7 ( Figure S6A-C). These upregulated genes in the TCGA STAD RNA-seq datasets had a higher mean PFE than the downregulated genes in the gastric cancer group   or healthy group ( Figure S6E, p < 2.2e-16 in the gastric cancer group, p < 2.2e-16 in the healthy group).
These features achieved good performance in the training and test datasets ( Figure 4A-C). All features could achieve an area under the curve (AUC) of 0.91 by using a stochastic gradient descent algorithm ( Figure S7B). However, 100 samples (50 patients with cancer and 50 healthy individuals) could not confirm the robustness and effectiveness of INAC, a large scale of samples is needed to estimate the performance of INAC. In addition, most of the gastric cancer samples and healthy samples had a consistent correct estimate based on the six cfDNA features. More than two cfDNA features predicted the wrong label in the same sample, but other cfDNA features could indicate the true label ( Figure S7A). We also finished the transcription factor (TF) nucleosome occupancy maps in our cohort, these TFs could also distinguish patients with cancer from healthy individuals ( Figure S7C,D). Furthermore, INAC also supported hundreds of machine learning methods to get the testing prediction accuracy through 10-fold crossvalidation. Six cfDNA features could achieve the best AUC by using different methods ( Figure S8A-F).
In this study, INAC was shown to be able to assess wholegenome-wide features of plasma cfDNA.

A U T H O R C O N T R I B U T I O N S
Jie Li, Zhaode Bu., Jiafu Ji. and Xun Lan. conceived this project. Jie Li. and Jiahui Chen. collected the patient blood and healthy blood. Jie Li. performed bioinformatics analysis. Xin Sun. performed the experiments. Jie Li. and Xun Lan. wrote the manuscript. All authors read or provided comments on the manuscript.

A C K N O W L E D G E M E N T S
This work was partially supported by grants (81972680 to X. L.) from the National Natural Science Foundation of China, and a start-up fund from Tsinghua University-Peking University Joined Center for Life Science.

C O N F L I C T O F I N T E R E S T S TAT E M E N T
The authors declare no conflict of interest.

D ATA AVA I L A B I L I T Y S TAT E M E N T
The accession number for the sequencing data has been deposited in the Genome Sequence Archive under project PRJCA013939 (https://ngdc.cncb.ac.cn/gsahuman/browse/HRA003821). The processed data of fragment size, CNV, TSS, PFE and TF has been deposited in OMIX002911 (https://ngdc.cncb.ac.cn/omix/view/ OMIX002911). Codes were deposited in the Git hub: https://github.com/jacklee2thu/INAC.